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Abstract 

Recently, Bandeira [5] introduced a new type of algorithm (the so-called probably certifiably 
correct algorithm) that combines fast solvers with the optimality certificates provided by convex 
relaxations. In this paper, we devise such an algorithm for the problem of fc-means clustering. 
First, we prove that Peng and Wei’s semidefinite relaxation of A:-means [20] is tight with high 
probability under a distribution of planted clusters called the stochastic ball model. Our proof 
follows from a new dual certificate for integral solutions of this semidefinite program. Next, we 
show how to test the optimality of a proposed fc-means solution using this dual certificate in 
quasilinear time. Finally, we analyze a version of spectral clustering from Peng and Wei [20] 
that is designed to solve A:-means in the case of two clusters. In particular, we show that this 
quasilinear-time method typically recovers planted clusters under the stochastic ball model. 


1 Introduction 

Clustering is a central problem in unsupervised machine learning. It consists of partitioning a given 
finite sequence of points in M™’ into k subsequences such that some dissimilarity function 

is minimized. Usually, this function is chosen ad hoc with an application in mind. A particularly 
common choice is the A:-means objective: 

minimize 
subject to 

Problem (Q is NP-hard in general |13j . A popular heuristic for solving fe-means is Lloyd’s algorithm, 
also known as the /c-means algorithm [15] . This algorithm alternates between calculating centroids 
of proto-clusters and reassigning points according to the nearest centroid. In general, Lloyd’s 
algorithm (and its variants |3l [19|) may converge to local minima of the A:-means objective (e.g., 
see section 5 of |3]). Furthermore, the output of Lloyd’s algorithm does not indicate how far it is 
from optimal. Instead, we seek a new sort of algorithm, recently introduced by Bandeira [5]: 

Definition 1. Let P he an optimization problem that depends on some input, and let D denote a 
probability distribution over possible inputs. Then a probably certifiably correct (PCC) algo¬ 
rithm for (P,D) is an algorithm that on input D ~ D produces a global optimizer o/P with high 
probability, and furthermore produces a certificate of having done so. 

Most non-convex optimization methods fail to produce a certificate of global optimality. How¬ 
ever, if a non-convex problem enjoys a convex relaxation, then solving the dual of this relaxation 
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will produce a certificate of (approximate) optimality. Along these lines, the A:-means problem en¬ 
joys a semidefinite relaxation. To see this, let 1a denote the indicator function ofAC{l,...,A^}, 
and define the N x N matrix D by Dij := \\xi — Xj\\ 2 - Then taking 

( 2 ) 

t=i ' 

the A:-means objective 0 may be expressed as ^Tr(T)X). Since X satishes several convex con¬ 
straints, we may relax the region of optimization to produce a convex program, namely, the Peng- 
Wei semidefinite relaxation |20] (cf. [6]): 

minimize Tr{DX) (3) 

subject to Tr(A) = k, XI = 1, X > 0, A" ^ 0 

Here, A > 0 means that each entry of X is nonnegative, whereas A ^ 0 means that A is symmetric 
and positive semidefinite. 

Recently, it was shown that under a certain random data model, this convex relaxation is 
tight with high probability [1], that is, every solution to the relaxed problem Q has the form Q, 
thereby identifying an optimal clustering. As such, in this high-probability event, one may solve 
the dual program to produce a certificate of optimality. However, semidefinite programming (SDP) 
solvers are notoriously slow. For example, running MATLAB’s built-in implementation of Lloyd’s 
algorithm on 64 points in will take about 0.001 seconds, whereas a CVX implementation HD 
of the dual of ^ for the same data takes about 20 seconds. Also, Lloyd’s algorithm scales much 
better than SDP solvers, and so one should expect this runtime disparity to only increase with larger 
datasets. Overall, while the SDP relaxation theoretically produces a certificate in polynomial time 
(e.g., by an interior-point method [IH]), it is far too slow to wait for in practice. 

As a fast alternative, Bandeira [5] recently devised a general technique to certify global opti¬ 
mality. This technique leverages several components simultaneously: 

(i) A fast non-convex solver that produces the optimal solution with high probability (under 
some reasonable probability distribution of problem instances). 

(ii) A convex relaxation that is tight with high probability (under the same distribution). 

(hi) A fast method of computing a certificate of global optimality for the output of the non-convex 
solver in (i) by exploiting convex duality with the relaxation in (ii). 

In the context of /c-means, one might expect Lloyd’s algorithm and the Peng-Wei SDP to be suitable 
choices for (i) and (ii), respectively. For (hi), one might adapt Bandeira’s original method in [5] 
based on complementary slackness (see Figure for an illustration). In this paper, we provide a 
theoretical basis for each of these components in the context of A:-means. 

1.1 Technical background and overview 

The hrst two components of a probably certihably correct algorithm require non-convex and convex 
solvers that perform well under some “reasonable” distribution of problem instances. In the context 
of geometric clustering, it has become popular recently to consider a particular model of data called 
the stochastic ball model, introduced in |17| : 
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Figure 1: (left) Depiction of complementary slackness. The horizontal axis represents a vector space in which 
we consider a cone program (e.g., a linear or semidehnite program), and the feasibility region of this program is 
highlighted in green. The dual program concerns another vector space, which we represent with the vertical axis 
and feasibility region highlighted in red. The downward-sloping line represents all pairs of points {x, y) that satisfy 
complementary slackness. Recall that when strong duality is satisfied, we have that x is primal-optimal and y is 
dual-optimal if and only if x is primal feasible, y is dual feasible, and {x, y) satisfy complementary slackness. As 
such, the intersection between the blue Cartesian product and the complementary slackness line represents all pairs 
of optimizers, (right) Bandeira’s probably certifiably correct technique [5]. Given a purported primal-optimizer x, 
we hrst check that x is primal-feasible. Next, we select y such that {x,y) satisfies complementary slackness. Finally, 
we check that y is dual-feasible. By complementary slackness, y is then a dual certificate of a:’s optimality in the 
primal program, which can be verified by comparing their values (a la strong duality). 


Definition 2 ((P, 7 , n)-stochastic ball model). Let {7a}a=i ball centers in M™'. For each a, 
draw i.i.d. vectors {ra,i}^=i from some rotation-invariant distribution T> supported on the unit ball. 
The points from cluster a are then taken to be Xa,i ■= ra,i + 7a- We denote A := mina^b ||7a — Tbib- 

Table summarizes the state of the art for recovery guarantees under the stochastic ball model. 
In jlTj, it was shown that an LP relaxation of A:-medians will, with high probability, recover clusters 
drawn from the stochastic ball model provided the smallest distance between ball centers is A > 
3.75. Note that exact recovery only makes sense for A > 2 (i.e., when the balls are disjoint). Once 
A > 4, any two points within a particular cluster are closer to each other than any two points from 
different clusters, and so in this regime, cluster recovery follows from a simple distance thresholding. 
For the A:-means problem, Awasthi et al. [1] studies the Peng~Wei semidehnite relaxation and 
demonstrates exact recovery in the regime A > 2y/2{l + where m is the dimension of the 

Euclidean space. 

As indicated in Table we also study the Peng-Wei SDP, but our guarantee is different from [3]. 
In particular, we demonstrate tightness in the regime A > 2 + k'^/m, which is near-optimal for 
large m. The source of this improvement is a different choice of dual certificate, which leads to the 
following result (see Section]^ for details): 

Theorem 3. Take X of the form and let P/^± denote the orthogonal projection onto the 
orthogonal complement of the span 0 /{1^4^}^,^;^. Then there exists an explicit matrix Z = Z{D,X) 
and scalar z = z{D,X) such that X is a solution to the semidefinite relaxation (© if 

Pf^xZPf^x < zPf^x. (4) 

To prove that A > 2 -\- k"^/m suffices for the SDP to recover the planted clustering under the 
stochastic ball model, we estimate the left- and right-hand sides of ^ with the help of standard 
techniques from random matrix theory and concentration of measure; see Appendix for the 
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Method 


Sufficient Condition Optimal? Reference 


Thresholding A > 4 Yes (simple exercise) 


/c-medians LP 

A > 4 

A > 3.75 

A > 2 

No 

No 

Yes 

Theorem 2 in 
Theorem 1 in |17) 
Theorem 1 in [3] 

A:-means LP 

A > 4 

Yes 

Theorem 9 in [3] 

A:-means SDP 

A > 27 / 2(1 + 1 /^) 

No 

Theorem 3 in [3] 


A > 2 + k‘^/m 

No 

Theorem 
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Spectral /c-means 

A> A*,k = 2 

Yes 

Theorem 

14 



Table 1; Summary of cluster recovery guarantees under the stochastic ball model. The second column reports 
sufficient separation between ball centers in order for the corresponding method to provably give exact recovery 
with high probability. The third column reports whether the sufficient condition on A cannot be improved. Here, 
A* = A*{'D, k) denotes the smallest value for which A > A* implies that minimizing the fe-means objective recovers 
planted clusters under the (P, 7 , n)-stochastic ball model with probability 1 — 


(rather technical) details. While this is an improvement over the condition from [Ij in the large-m 
regime, there are other regimes (e.g., k = m) for which their condition is much better, leaving open 
the question of what the optimal bound is. Conjecture 4 in [3] suggests that A > 2 suffices for the 
/t-means SDP to recover planted clusters under the stochastic ball model, but as we illustrate in 
Section 2.3, this conjecture is surprisingly false. 

Having accomplished component (ii) in Bandeira’s PCC technique, we tackle component (hi) 
next. For this, consider the matrix 


A:=^n^ + P^^ZP^^, (5) 

where z and Z come from Theorem]^ Since the all-ones vector 1 lies in the span of {lAt}t=ii 
have that 1 spans the unique leading eigenspace of A precisely when Pj^±ZPj^± -< zP^-l, which in 
turn implies that A is a fc-means optimal clustering by Theorem]^ As such, component (iii) can 
be accomplished by solving the following fundamental problem from linear algebra: 


Problem 4. Given a symmetric matrix A G and an eigenvector v of A, determine whether the 

span of V is the unique leading eigenspace, that is, the corresponding eigenvalue A has multiplicity 1 
and satisfies |A| > |A'| for every other eigenvalue A' of A. 

Interestingly, this same problem appeared in Bandeira’s original PCC theory [5], but it was left 
unresolved. In this paper, we fill this gap by developing a so-called power iteration detector, which 
applies the power iteration to a random initialization on the unit sphere. Due to the randomness, 
the power iteration produces a test statistic that allows us to infer whether {A, v) satisfies the 
desired leading eigenspace condition. In Section we pose this as a hypothesis test, and we 
estimate the associated error probabilities. In addition, we show how to leverage the structure of 
A defined by ([^ and Theorem to compute the matrix-vector multiplication Ax for any given 
X in only 0{kmN) operations, thereby allowing the test statistic to be computed in linear time 
(up to the spectral gap of A and the desired confidence for the hypothesis test). See Figure]^ for 
an illustration of the runtime of our method. Overall, the power iteration detector will deliver a 
highly confident inference on whether (A, v) satisfies the leading eigenspace condition, which in 
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Figure 2; Take two unit balls in R® at distance 2.3 apart. For each N G {2®, 2^,..., 2^®}, we perform 300 trials 
of the following experiment: Draw N/2 points uniformly at random from each ball, and then compute four different 
functions: (a) MATLAB’s built-in implementation of fe-means-|—b, (b) a CVX implementation [11] of the fc-means 
SDP (c) the power iteration detector (Algorithm with A given by if- and (d) spectral fc-means clustering 
(Algorithm]^. For each trial, we recorded the runtime in seconds. Above, we plot the average runtime along with 
error bars for standard deviation. For the record, the power iteration detector failed to certify optimality (i.e., reject 
Ho in (14l) in at most 3% of the trials with N < 2^, but rejected Ho in every trial otherwise; similarly, spectral 
fc-means failed to recover the planted clusters in two of the trials with A = 2®. Our implementation of the fc-means 
SDP was too slow to perform trials with A > 2^ in a reasonable amount of time, so we only recorded runtimes for 
A G {2®, 2^^, 2®, 2®}. As the plot illustrates, the other algorithms ran in quasilinear time, as expected. 


turn certifies the optimality of X up to the prescribed confidence level. Of course, one may remove 
the need for a confidence level by opting for deterministic spectral methods, but we have no idea 
how to accomplish this in linear or even near-linear time. 

Now that we have discussed components (ii) and (iii) in Bandeira’s PCC technique, we conclude 
by discussing component (i). While we presume that there exists a fast initialization of Lloyd’s 
algorithm that performs well under the stochastic ball model, we leave this investigation for future 
research. Instead, Section considers a spectral method introduced by Peng and Wei in [20] . We 
show that when k = 2, this method performs as well as the optimizer of the original fe-means problem 
under the stochastic ball model. Figure [^illustrates the quasilinear runtime of this approach. 
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1.2 Outline 


In this paper, we provide a theoretical analysis of probably certifiably correct A:-means clustering, 
and we do so by developing components (i), (ii) and (iii) of Bandeira’s general technique. First, 
we investigate (ii) in Section by analyzing the tightness of the Peng-Wei SDP. In particular, we 
choose a different dual certificate from the one used in [3], and our choice demonstrates tightness 
in the SDP for clusters that are near-optimally close. Section then addresses (iii) by providing 
a fast method of computing this dual certificate given the optimal fc-means partition. In fact, 
a subroutine of our method (the so-called power iteration detector) resolves a gap in Bandeira’s 
original PCC theory [5], and as such, we expect this to be leveraged in future PCC algorithms. We 
conclude in Section with some theoretical guarantees for (i). Here, we focus on the case k = 2, 
and we show that a slight modification of the spectral clustering-based method in m manages to 
recover the optimal A:-means partition with high probability under the stochastic ball model. We 
conclude in Section with a discussion of various open problems. 

2 A typically tight relaxation of /c-means 

This section establishes that the Peng-Wei semidefinite relaxation Q of the /c-means problem Q 
is typically tight under the stochastic ball model. First, we find a deterministic condition on the set 
of points under which the relaxation finds the /c-means-optimal solution. Later, we discuss when 
this deterministic condition is satisfied with high probability under the stochastic ball model. 

2.1 The dual program 

The following is the dual program of Q: 

N 

kz + '^^ai ( 6 ) 

i=l 

TV ^ N N ^ 

Q := zl + ^ ai ■ 2 ^ ^ ) + D hO 

i=l 2=1 j=i 

> 0 

For notational simplicity, from this point forward, we organize indices according to clusters. 
For example, la shall denote the indicator function of the ath cluster. Also, we shuffle the rows 
and columns of X and D into blocks that correspond to clusters; for example, the {i,j)ih. entry of 
the (a, 6)th block of D is given by We also index a in terms of clusters; for example, the ith 

entry of the ath block of a is denoted aa^i- For 13, we identify 

N N ^ 

i=l j=i 

Indeed, when i < j, the (z, j)th entry of j3 is /3jj. We also consider j3 as having its rows and columns 
shuffled according to clusters, so that the (i,j)th entry of the (a, 6)th block is I3^'^\ 

With this notation, the following proposition characterizes all possible dual certificates of Q: 

Proposition 5 (Theorem 4 in |12j . cf. a). Take X := X]a=i ^^alj , where Ua denotes the number 
of points in cluster t. The following are equivalent: 


minimize 


subject to 
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(a) X is a solution to the semidefinite relaxation (§. 

(b) Every solution to the dual program satisfies 

g(“-“)l = 0, /?(“’“) =0 Va e 

(c) Every solution to the dual program (© satisfies 

Uar =— — z + -Vo G {1,...,/c}, r G o. 

’ Ua ni Ua 


The following subsection will leverage this result to identify a condition on D that implies that 
the SDP Q relaxation is tight. 


2.2 Selecting a dual certificate 

The goal is to certify when the SDP relaxation is tight. In this event, Proposition characterizes 
acceptable dual certificates {z,a,/3), but this information fails to uniquely determine a certificate. 
In this subsection, we will motivate the application of additional constraints on dual certificates so 
as to identify certifiable instances. 

We start by reviewing the characterization of dual certificates {z, a, j5) provided in Proposition]^ 
In particular, a is completely determined by z, and so z and fi are the only remaining free variables. 
Indeed, for every a, 6 G {1,..., k}, we have 


t=l iGt 


{a,b) 


i£a 

= -l(- + -]z 

^\na Ub, 


^ “hi • + ^eli) 

Y + Z] 

j£b 

+ Y (- —ej 

na J2 


jeb 


and so since 


Q — + E E “h* ■ 2 “ 2^ 

i=l iet 

we may write Q = z{I — E) + M — B, where 


‘2\na UbJ 


:= - —eJ ^e*! 

iea na J2 


jeb 


2j(a,fe) _ lo(a,b) 
2 


(7) 

( 8 ) 
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for every a,b £ {1,... ,k}. The following is one way to formulate our task: Given D and a clustering 
X (which in turn determines E and M), determine whether there exist feasible z and B such that 
Q ^ 0; here, feasibility only requires B to be symmetric with nonnegative entries and = 0 for 

every o G {1, ..., k}. We opt for a slightly more modest goal: Find z = z{D, X) and B = B{D, X) 
such that Q ^ 0 for a large family of D’s. 

Before determining z and B, we first analyze E: 

Lemma 6. Let E be the matrix defined by 0. Then rank(iil) G {1,2}. The eigenvalue of largest 
magnitude is X > k, and when rank(ill) = 2, the other nonzero eigenvalue of E is negative. The 
eigenvectors corresponding to nonzero eigenvalues lie in the span of {la}a=i- 

Proof. Writing 




1(- + - im; = 


rin 


Ub 



we see that rank(i?) G {1, 2}, and it is easy to calculate El = Nk and Tr(£') = k. Observe that 


A = sup x~^Ex > —1^El = /c, 
ll^l|2 = l 


and combining with rank(i?) < 2 and Tr(£') = k then implies that the other nonzero eigenvalue (if 
there is one) is negative. Finally, any eigenvector of E with a nonzero eigenvalue necessarily lies in 
the column space of E, which is a subspace of span{la}^^;^ by the definition of E. □ 

When finding z and B such that Q = z{I — E) + M — B y 0, it will be useful that I — E has 
only one negative eigenvalue to correct. Let v denote the corresponding eigenvector. Then we will 
pick B so that v is also an eigenvector of M — B. Since we want Q E 0 for as many instances 
of D as possible, we will then pick z as large as possible, thereby sending v to the nullspace of 
Q. Unfortunately, the authors found that this constraint fails to uniquely determine B in general. 
Instead, we impose a stronger constraint: 


Qla = 0 VaG {!,..., A:}. 

(This constraint implies Qv = 0 by Lemma[^) To see the implications of this constraint, note that 
we already necessarily have 

(Qla)a = [iz{I -E)+M- B)la^ = z{I - = z{l - = 0, 

and so it remains to impose 

0 = {Qlb)a = {{z{I -E) + M- B)lb)^ 


2na 

In order for there to exist a vector B^°‘’^'^l > 0 that satisfies (©, z must satisfy 

^na±nb ^ 

2,Tln 


( 9 ) 




and since z is independent of (a, b), we conclude that 


0,71 

z< min - - —min(M^“’^H). 

a,be{l,...,k} Ua + Uh 
a^b 


(10) 


Again, in order to ensure z{I — E) + M — B^O for as many instances of D as possible, we intend 
to choose z as large as possible. Luckily, there is a choice of B which satisfies Q for every (a, 6), 
even when 2 ; satisfies equality in (10). Indeed, we define 


u^a,b) :=M^^^^h-z 


ng + Ub 
2n„, 


1, P(a,b) ■= := 


P{b,a) 


^(a,b)ulb,a) ( 11 ) 


for every a,b G {1,... ,A;} with a ^ b. Then by design, B immediately satisfies (©• Also, note 
that P(a,fe) = P(b,a)i so meaning B is symmetric. Finally, we necessarily 

have > 0 (and thns p(^a,b) > 0) by ([^, and we implicitly reqnire P(a,b) > 0 lo^ division to be 
permissible. As snch, we also have > 0, as desired. 

Now that we have selected 2 ; and B, it remains to check that Q ^ 0. By constrnction, we 

already have A := spanjla}^^^ in the nnllspace of Q, and so it suffices to ensure 

0 ^ QP\z = P\± {^{,1 ~ E) + Af — P— zPa-^ T Pa^ (Af ~ P)Pa-^ ■ 

Here, Pj<^± denotes the orthogonal projection onto the orthogonal complement of A. Rearranging 
then gives the following result; 

Theorem 7. Take X := X)t=i ^lil7> where nt denotes the number of points in cluster t. Consider 
M defined by ([^, pick z so as to satisfy equality in (10), take B defined by ( [IT| ), and let A denote 

the span of Then X is a solution to the semidefinite relaxation (i if 


Pj^±iB - M)P^± ^ zPj^z. 


( 12 ) 


The next snbsection leverages this sufficient condition to establish that the Peng-Wei SDP 
is typically tight under the stochastic ball model. 


2.3 Integrality of the relaxation under the stochastic ball model 


We first note that our sufficient condition (12) is implied by 


\\Pa±MPa±\\2^2 + \\Pa^PPa^\\2^2 < 2 ^- 

By fnrther analyzing the left-hand side above (see Appendix [A]), we arrive at the following corollary: 

Corollary 8. Take X := X)t=i r^lilt"; where nt denotes the number of points in cluster t. Let T 
denote the mx N matrix whose {a,i)th column is Xg^i — Cg, where 

1 


Cn. •- 


n. 


E 




Xn 


denotes the empirical center of cluster a. Consider M defined by ([^, pick z so as to satisfy equality 
in (10), and take P[a^b) defined by ©• Then X is a solution to the semidefinite relaxation (U) if 


2IITI 


2 ||PixAf(“’^)l||2||Pi±M(Mi||2 

2^2 + 2 ^ 2 ^ - 
a=l b=a-\-l 


P{a,b) 
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In Appendix [B| we leverage the stochastic ball model to bound each term in Corollary and 
in doing so, we identify a regime in which the data points typically satisfy the sufficient condition 
given in Corollary]^ 

Theorem 9. The k-means semidefinite relaxation Q reeovers the planted clusters in the (2?,7, n)- 
stochastic ball model with probability 1 — provided A > 2 + 1? jm. 


We note that Theorem is an improvement to the main result of the authors’ preprint ^^ 
When k = Theorem is near-optimal, and in this sense, it’s a significant improvement 

over the sufficient condition 

2\/2M+ ^ 


A > 


(13) 


given in |1]. However, there are regimes (e.g., k = m) for which (13) is much better, leaving open 
the question of what the optimal bound is. Conjecture 4 in [3] suggests that A > 2 suffices for 
the A:-means SDP to recover planted clusters under the stochastic ball model, but as we illustrate 
below, this conjecture is surprisingly false. 

Consider the special case where m = 1, P is uniform on {±1}, and k = 2. Centering the 
two balls on ±A/2, then all of the points land in {±A/2 ± 1}. The /c-means-optimal clustering 
will partition the real line into two semi-infinite intervals, and so there are three possible ways of 
clustering these points. Suppose exactly A/4 of the points land in each of the 4 positions. Then by 
symmetry, there are only two ways to cluster: either we select the planted clusters, or we make the 
left-most location its own cluster. Interestingly, a little algebra reveals that this second alternative 
is strictly better in the /c-means sense provided A < 1 -|- \/3 ~ 2.7320. Also, in this regime, then as 
N gets large, the proportion of points in each position will be so close to 1/4 (with high probability) 
that this clustering will beat the planted clusters. 

Overall, when m = 1 and k = 2, then A > 1 -|- ^/3 is necessary for minimizing the /s-means 
objective to recover planted clusters for an arbitrary T. As a relaxation, the /c-means SDP recovers 
planted clusters only if minimizing the /c-means objective does so as well, and so it inherits this nec¬ 
essary condition, thereby disproving Conjecture 4 in |3j. Furthermore, as Figure j^left) illustrates, 
a similar counterexample is available in higher dimensions. 

To study when the SDP recovers the clusters, let’s continue with the case where m = 1 and 
k = 2. We know that minimizing /c-means will recover the clusters with high probability provided 
A > 1 -|- \/3. However, Theorem only guarantees that the SDP recovers the clusters when A > 6; 
in fact, (13) is slightly better here, giving that A > 5.6569 suffices. To shed light on the disparity. 
Figure [3 (center) illustrates the performance of the SDP for different values of A. Observe that the 
SDP is often tight when A is close to 2, but it doesn’t reliably recover the planted clusters until 
A > 4. We suspect that A = 4 is a phase transition for cluster recovery in this case. 

Qualitatively, the biggest difference between Theorem]^ and (13) is the dependence on k that 
Theorem [^exhibits. Figure [fright) illustrates that this comes from (12), meaning that one would 
need to use a completely different dual certificate in order to remove this dependence. 


3 A fast test for /c-means optimality 

In this section, we leverage the certificate ( |12| ) to test the optimality of a candidate /s-means 
solution. We first show how to solve a more general problem from linear algebra, and then we 
apply our solution to devise a fast test for /c-means optimality (as well as fast test for a related 
PCC algorithm). 
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Figure 3: (left) Take two unit disks in with centers on the i-axis at distance 2.08 apart. Let xo denote 

the smallest possible a;-coordinate in the disk on the right. For each disk, draw N/2 = 50,000 points uniformly at 
random from the perimeter. Given 9, cluster the points according to whether the x-coordinate is smaller than xo + 6. 
When 6 = 0, this clustering gives the planted clusters, and the A:-means objective (divided by N) is 1. We plot this 
normalized fc-means objective for 6 £ [0, 0.2]. Since N is large, this curve is very close to its expected shape, and we 
see that there are clusters whose fc-means value is smaller than that of the planted clustering, (center) Take two 
intervals of width 2 in R, and let A denote the distance between the midpoints of these intervals. For each interval, 
draw 10 points at random from its endpoints, and then run the fc-means SDP. For each A = 2 : 0.1 : 5, after running 
2, 000 trials of this experiment, we plot the proportion of trials for which the SDP relaxation was tight (dashed line) 
and the proportion of trials for which the SDP recovered the planted clusters (solid line). In this case, cluster recovery 
appears to exhibit a phase transition at A = 4. (right) For each A = 2 : 0.1 : 3 and fc = 2 : 2 : 20, consider the 
unit balls in R^° centered at where e; denotes the ith identity basis element. Draw 100 points uniformly 

from each ball, and test if the resulting data points satisfy ( |12[ ). After performing 10 trials of this experiment for 
each (A, fc), we shade the corresponding pixel according to the proportion of successful trials (white means every trial 
satisfied @). This plot indicates that our certificate (|12[) is to blame for Theorem [^s dependence on fc. 




3.1 Leading eigenvector hypothesis test 

This subsection concerns Problem To solve this problem, one might be inclined to apply the 
power method: 

Proposition 10 (Theorem 8.2.1 in [TO])- Let A G be a symmetric matrix with eigenvalues 

(counting multiplicities) satisfying 

I All > IA2I > • • • > I An I, 

and with corresponding orthonormal eigenvectors Pick a unit-norm vector qq G M” and 

consider the power iteration g’j+i := Aqj/\\Aqj\\ 2 . If qo is not orthogonal to v\, then 

( 1 ^ 7 dj? > 1 - [{vjqo)~^ - 1 ) • 

Notice that the above convergence guarantee depends on the quality of the initialization qQ. To 
use this guarantee, draw qQ at random from the unit sphere so that qo is not orthogonal to vi almost 
surely; one might then analyze the statistics of vjqo to produce statistics on the time required for 
convergence. The power method is typically used to find a leading eigenvector, but for our problem, 
we already have access to an eigenvector v, and we are tasked with determining whether v is the 
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Algorithm 1: Power iteration detector 


Input: Symmetric matrix A G unit eigenvector v G tolerance e > 0 

Output: Decision of whether to accept ffg or to reject Hq and accept Hi as given in (14) 
A •(— v'^ Av 

Draw q uniformly at random from the unit sphere in M"" 
while no decision has been made do 
if \q~^Aq\ > |A| then 
I Print accept Hq 
else if {v~^q)‘^ > 1 — e then 
I Print reject Hq and accept Hi 
end 

q G- Aq/\\Aq \\2 

end 


unique leading eigenvector. Intuitively, if you run the power method from a random initialization 
and it happens to converge to v, then this would have been a remarkable coincidence if v were not 
the unique leading eigenvector. Since we will only run finitely many iterations, how do we decide 
when we are sufficiently confident? The remainder of this subsection answers this question. 

Given a symmetric matrix A G and a unit eigenvector v of A, consider the hypotheses 

Hq : span(u) is not the unique leading eigenspace of A, 

Hi: span(u) is the unique leading eigenspace of A. 

To test these hypotheses, pick a tolerance e > 0 and run the power iteration detector (see Algo¬ 
rithm!^. This detector terminates either by accepting Hq or by rejecting Hq and accepting Hi. We 
say the detector fails to reject Hq if it either accepts Hq or fails to terminate. Before analyzing 
this detector, we consider the following definition: 

Definition 11. Given a symmetric matrix A G and unit eigenvector v of A, put A = v'^ Av, 

and let Ai denote a leading eigenvalue of A (i.e., |Ai| = ||A|| 2 ->. 2 j- IPe say (A,u) is degenerate if 

(a) the eigenvalue X of A has multiplicity > 2, 

(b) —A is an eigenvalue of A, or 


(c) — Ai is an eigenvalue of A. 

Theorem 12. Consider the power iteration detector (Algorithm [^, let qj denote q at the jth 
iteration (with qQ being the initialization), and let tTe denote the probability that {eJqQ)'^ < e. 

(i) {A,v) is degenerate only if Hq holds. If {A,v) is non-degenerate, then the power iteration 
detector terminates in finite time with probability 1. 

(ii) The power iteration detector incurs the following error rates: 


Pr 


^ reject Hq and accept Hi Hq ^ 


< vr. 


jau 10 rejeci 


■LJ. 




(Hi) If Hi holds, then 


min |j : {v~^qj)"^ > 1 — e| < 


31og(l/e) . 

21og(Ai/A2) 


with probability > 1 — vr^. 
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Proof. Denote the eigenvalues of A by (counting multiplicities), ordered in such a way that 

I All > • • • > I An I, and consider the corresponding orthonormal eigenvectors where v = Vp 

for some p. 

For (i), hrst note that Hi implies that {A,v) is non-degenerate, and so the contrapositive 
gives the first claim. Next, suppose {A,v) is non-degenerate. If Hi holds, then {v~^Qj)'^ —>■ 1 by 
Proposition [T0| provided qo is not orthogonal to v, and so the power iteration detector terminates 
with probability 1. Otherwise, Hq holds, and so the non-degeneracy of {A,v) implies that the 
eigenspace corresponding to Ai is the unique leading eigenspace of A, and furthermore, |Ai| > |A|. 
Following the proof of Theorem 8.2.1 in [lOj . we also have 

Putting r := min{i : |Ai| < |Ai|}, then 


\qjAq. - All ^ 


Er.lN9i)N’(V-Ai) 




< 


|Ai - A, 


<|Ai-A, 


2i 


i-||A,golli 

llA.golli 


A,. ^ 
■^1 , 


where P^i denotes the orthogonal projection onto the eigenspace corresponding to Ai. As such, 
IgjAgjl —>■ I All > |A| provided Px^qo / 0, and so the power iteration detector terminates with 
probability 1 . 

For (ii), we first consider the case of a false positive. Taking v = Vp for p 7 ^ 1, note that 
{v~^qj)"^ > 1 — e implies 


e > 1 - (v'^qj)^ = \\qj\\l - {vjqjf = '^{vjqj)^ > (vjqjf. 

i=l 

i^p 


Also, since ||Ax ||2 < |Ai|||x ||2 for all x G M”, we have that qjY monotonically increases with j; 

2 /X a2 \2 


K hh) = Vi 


T 


Mj 


\\Aq 


'ill2 




||Ag. 


'3 112 


IkilP 


As such, e > {vJqjY > {vJqoY- Overall, when Hq holds, the power iteration detector rejects 
Hq only if go is initialized poorly, i.e., (u^go)^ < e, which occurs with probability vTg (since go 
has a rotation-invariant probability distribution). For the false negative error rate, note that 
Proposition 10 gives that Hi implies convergence {v^qjY 1 provided go is not orthogonal to u, 
i.e., with probability 1 . 

For (iii), we want j such that {v^qjY > 1 — e. By Proposition 
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it suffices to have 


((G*)--i)(h) 


2i 


< e. 


In the event that (u^ qoY ^ ^ (which has probability 1 — vr^), it further suffices to have 


-2 f ^2 


2i 


E' 


Taking logs and rearranging then gives the result. 


□ 
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To estimate e and vr^, first note that go has a rotation-invariant probability distribntion, and so 
linearity of expectation gives 


1 . ^ 1 1 

= -^E[(e7go)^] = -Elkolli = -• 

2=1 

Thus, in order to make tTe small, we should expect to have e <C 1/n. The following lemma gives 
that such choices of e suffice for tt^ to be small: 

Lemma 13. If e> then ir^ < 3-y/ne. 

Proof. First, observe that (e^go)^ is equal in distribution to jQ, where Z has standard normal 
distribution and Q has chi-squared distribution with n degrees of freedom (Z and Q are indepen¬ 
dent). The probability density function of Z has a maximal value of at zero, and so 


Pr (^Z^ < a) < 

Also, Lemma 1 in |14) gives 

Pr > n -|- 2y/nx -|- 2x^ < e~^ Vx > 0. 

Therefore, picking a = 5ne and x = n, the union bound gives 

Pr ^(e7go)^ < = Pr < Pr (^Z^ < 5ree^ -|- Pr > 5n^ < -|- e“” < 3\/n€. □ 

Overall, if we take e = for c > 0, then if Hq is true, our detector will produce a false 

positive with probability 0{n~^). On the other hand, if Hi is true, then with probability 1 — 
our detector will reject Hq after Os{clogn) power iterations, provided IA 2 I < (1 — (^)|Ai|. 



3.2 Testing optimality with the power iteration detector 

In this subsection, we leverage the power iteration detector to test A:-means optimality. Note that 
the sufficient condition (12) holds if and only if n := is a leading eigenvector of the matrix 


+ = + P^^iB - D)Pji^. 

(The second equality follows from distributing the Pj ^±’s and recalling the definition of M in ([i 
As such, it suffices that {A,v) satisfy Hi in (14). Overall, given a collection of points C 


(15) 


and a proposed partition Ai U • • • U A*, = {1,..., A"}, we can produce the corresponding matrix A 


(defined above) and then rnn the power iteration detector of the previous subsection to test (12). In 


particnlar, a positive test with tolerance e will yield > 1—vTe confidence that the proposed partition is 
optimal under the fe-means objective. Furthermore, as we detail below, the matrix-vector products 
computed in the power iteration detector have a compntationally cheap implementation. 

Given an m x matrix $0 = [xa,i ■ ■ ■ Xa^ua] for each a G {1,..., k}, we follow the following 
procednre to implement the corresponding function x 1 —)• Ax as dehned in (lill): 

1. Compute Va G such that {va)i = for every a G {1,..., A;} in 0{mN) operations. 

Let u G denote the vector whose ath block is Un. 


14 







2 . 

3. 

4. 

5. 

6 . 

7. 

8 . 

9. 


Define the function (a, b, x) i—)• such that = t'al"'" — + li'J■ 

Running this function costs 0{m{na + rib)) operations. 

Define the function x i—)• Dx such that D = , where 4> = [<f>i • • • 

Running this function costs 0{mN) operations. 

Compute for every a G {1,..., /c} in 0{mN) operations. 

Define the function (a, b, x) i—)■ such that + /ral"*" + 

Running this function costs 0(m(na + rib)) operations. 


Compute z = mina^ft 


2na 


na+rih 


mm 


in(M(“’^)l) in 0{kmN) operations. 


Compute n(a,fe) = ^ every o, 6 G {1,..., A;}, a / 6 in 0{kmN) operations. 

Compute P(a,b) = '^Ja b)^ every a, 6 G {1,..., k}, a / 6 in 0{kN) operations. 

Define the function x i—)• Bx such that the ath block of the output is given by 

{Bx)a = 2^- 


6=1 

b^a 


P(b,a) 


Running this function costs 0{kmN) operations. 

10. Define the function x i—)• Pf^xx such that Py\^± = / — X]a=i 
Running this function costs 0{N) operations. 

11. Define the function x i—>■ Ax such that A = ;§^ll''^ + Pj<^±{B — D)Pj^j_. 
Running this function costs 0{kmN) operations. 


Overall, after 0{kmN) operations of preprocessing, one may compute the function x i—)> Ax for 
any given x in 0{kmN) operations. (Observe that this is the same complexity as each iteration of 
Lloyd’s algorithm, and as we illustrate in Figure]^ the runtimes are comparable.) 

At this point, we take a short aside to illustrate the utility of the power iteration detector 
beyond fc-means clustering. The original problem for which a PCC algorithm was developed was 
community recovery under the stochastic block model [5]. For this random graph, there are 
two communities of vertices, each of size n/2, and edges are drawn independently at random with 
probability p if the pair of vertices belong to the same community, and with probability < p if 
they come from different communities. Given the random edges, the maximum likelihood estimator 
for the communities is given by the vertex partition of two sets of size n/2 with the minimum cut. 
Given a partition of the vertices, let X denote the corresponding n x n matrix of ±ls such that 
Xij = 1 precisely when i and j belong to the same community. Given the adjacency matrix A of 
the random graph, one may express the cut of a partition X in terms of Tr(AX). Furthermore, 
X satisfies the convex constraints Xu = 1 and A ^ 0, and so one may relax to these constraints 
to obtain a semidefinite program and hope that the relaxation is typically tight over a large region 
of {p, q). Amazingly, this relaxation is typically tight precisely over the region of {p, q) for which 
community recovery is information-theoretically possible [I]. 

Given A, put B := 2A — ll"'" + I, and given a vector x G M”, define the corresponding n x n 
diagonal matrix Dx by {Dx)n := XiY^^=iBijXj. In [5], Bandeira observes that, given a partition 
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matrix X by some means (such as the fast algorithm provided in [2]), then X = xx~^ is SDP- 
optimal if both = 0 and the second smallest eigenvalue of Dx — B is strictly positive, meaning 
the partition gives the maximum likelihood estimator for the communities. However, as Bandeira 
notes, the computational bottleneck here is estimating the second smallest eigenvalue of Dx — B, 
and he suggests that a randomized power method-like algorithm might suffice, but leaves the 
investigation for future research. 

Here, we show how the power iteration detector fills this void in the theory. First, we note 
that in the interesting regime of {p, q), the number of nonzero entries in A is 0(n log n) with high 
probability [T]. As such, the function x i—>■ Bx can exploit this sparsity to take only 0(n log n) 
operations. This in turn allows for the computation of the diagonal of Dx to cost 0(n log n) 
operations. Next, note that 

\\Dx — H||2^.2 < ||.C)a;||2-s.2 + ||2A — ll'''||2-;.2 + ||F||2-s.2 

^ II || 2^2 T ||2A — 11~*~ ||^ + 1 = max | {^Dx)ii\ T u + 1 =: A, 

i 

and that A can be computed in 0(n) operations after computing the diagonal of Dx- Also, it takes 
0{n) operations to verify x~^l = 0. Assuming x~^l = 0, then the second smallest eigenvalue of 
Dx — B is strictly positive if and only if x spans the unique leading eigenspace of XI — Dx + B. 
Thus, one may test this condition using the power iteration detector, and furthermore, each iteration 
will take only 0(n log n) operations, thanks to the sparsity of A. 

4 A fast /c-means solver for two clusters 

The previous section illustrated how to quickly test whether a proposed solution to the fe-means 
problem is optimal. In particular, this test will be successful with high probability if the data 
follows the stochastic ball model with A > 2 + A;^/m. It remains to find a fast fc-means solver which 
also performs in this regime. 

In doing so, we maintain the philosophy that our algorithm should not “see” the stochastic ball 
model. Indeed, we view the stochastic ball model as a method of evaluating clustering algorithms 
rather than a realistic data model. For example, Lloyd’s algorithm can be viewed as an alternating 
minimization of the lifted objective function: 

k 

f{Ai,.. ., Afc,ci,.. • Tfc) ^ ^ \\xi - Cilp, Ai U • • • U = {1,.. .,N}, ci,... ,ca: G M”", 

t=l i£At 

and since this function is minimized at the fc-means optimizer (regardless of how the data is dis¬ 
tributed), such an algorithm is acceptable. On the other hand, one might consider matching the 
stochastic ball model to the data by maximizing the following function: 

N k 

g{ci ,... ,Cfc) := '^'^Pv{xi - ct), ci,... ,Cfc G M™', 
i=l t=l 

where pd(-) denotes the density function of V, which is supported on the unit ball centered at the 
origin. One could certainly devise a fast greedy method such as matching pursuit [l6] to optimize 
this objective function (especially if px> is smooth), but doing so violates our philosophy. 

In [20] . Peng and Wei showed that A:-means is equivalent to the following program: 

minimize Tv{DX) (16) 

subject to = a, A^ = A, Tr(A) = k, A1 = 1, A > 0 


16 


One may quickly observe that the SDP Q we analyzed in Section is a relaxation of this program. 
In this section, we follow Peng and Wei [20) by considering another relaxation of (16), obtained by 
discarding the X > 0 constraint (this is known as the spectral clustering relaxation [318]). We 
first denote the m x N matrix <I> = [xi • • • xat]. Without loss of generality, the data set is centered 
at the origin so that = 0. Letting v denote the N x 1 vector with Vi = \\xi 


li) then 


Dij — 


\Xi — Xn 


= ||xj||2 — 2x7 


+ Ikjlli = “ 2<I>'^<I> + 


)ir 


As such, D = v'\^ — 2<I>''^<h + , and so the constraints X = X~^ and A1 = 1 together imply an 

alternative expression for the objective function: 


Tr(DA) = Tt{vI^X - 2d>'^$A + X) 

= Triul^X^) - 2Tr(d>’^$A) + Tr(Alzy’^) 
= - 2Tr(d>’^$A). 


We conclude that minimizing Tr(Z)A) is equivalent to maximizing Tr(<I>''^<hA). 

Next, we observe that the feasible X in our relaxation are precisely the rank-Zc N xN orthogonal 
projection matrices satisfying A1 = 1. This in turn is equivalent to X having the form X = 
Tfll"'' + Y, where T is a rank-(A: — 1) x A^ orthogonal projection matrix satisfying Tl = 0. 
Discarding the Tl = 0 constraint produces the following relaxation of (16): 


maximize Tr(<I>''~$y) (17) 

subject to = y, y2 = y, Tr(y) = k-i 


For general values of k, this program amounts to finding k — 1 principal components of the data. 
Recalling our initial clustering goal, after finding the optimal Y, it remains to take X = 7^ll"*" + Y 
and then round to a nearby member of the feasibility region in (16). In m. Peng and Wei focus 
on the k = 2 case; they reduce the rounding step to a 2-means problem on the real line, and they 
establish an approximation ratio of 2 for this relax-and-round procedure. Here, we are concerned 
with exact recovery under the stochastic ball model, and as such, we slightly modify the rounding 
step. 

When k = 2, the solution to has the form Y = yy~^, where y is a leading unit eigenvector 
of <I>^<h. Our task is to find a matrix of the form 1 ^ 1^17 + j^l^lj with AUB = {1,... ,N} that 
is close to + yy~^■ To this end, it seems natural to consider 


Ag := {i:yi < 9}, 


Bg := Ag 


for some threshold 9. Since the data is centered (<hl = 0), one may be inclined to take 0 = 0, but 
this will be a poor choice if the true clusters have significantly different numbers of points. Instead, 
we select the 9 which minimizes the Zc-means objective of {Ag, Bg). Since we only need to consider 
A^ — 1 choices of 9, this is plausibly tractable, although computing the Zc-means objective once costs 
0{mN) operations, and so some care is necessary to keep the algorithm fast. 

We will show how to find the optimal (Ag, Bg) in 0{{m + logN)N) operations using a simple 
dynamic program. Order the indices so that yi < ■ ■ ■ < yN- Then the function to minimize is 




\X 'j X 4 f 


J'\\2 


+ 


1 


N 


N 


j=l j'=l 


N-i 


:E E 


Xo — Xnl 


3 ' 112 ■ 


j=i+l j'=i+l 
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Algorithm 2: Spectral fe-means clustering (for two clusters) 

Input: m X N matrix <1) = [xi • • • xat] of points to be clustered 
Output: Clusters A\J B = {1, ... ,N} 

Subtract centroid ^ from each column of $ to produce ^>o 

Compute leading eigenvector y of <l>o 

Find 0 that minimizes the A:-means objective of ({i : yi < 9}, {i : yi > 9}) 
{A, B) ^ ({i : yi < 9], {i:yi> 9}) 


Expanding the square and distributing sums gives 

i i 

Vi+i = Vi + 2^ llxjlla - + 2i||xi+i||2, 

i=i i=i 

and the vfs satisfy a similar recursion rule. As such, one may iteratively compute the Ui’s and 
vf’s before computing the /(i)’s and then minimizing. Overall, the following procedure finds the 
optimal {Aq^Bq) in 0((m + log A) A) operations: 

1. Sort the entries yi < • • • < ?/Ar in 0(Alog A) operations. 

2. Iteratively compute 

i N i N 

•= S2ii) :='^\\xj\\l, := ^ \\xj\\l 

j=l i=*+l j=l j=i+l 

for every i G {1,..., A — 1} in 0{mN) operations. 

3. Compute ui = 0 and Uj+i = Vi + 2s2{i) — + 2i||xj+i Hi for every z G {1,..., A — 2} 

in 0{mN) operations. 

4. Compute = 0 and vf_i = vf+2s2{i)—4:xj s5(i)+2(A—i)||xj||| for every i G {A—1,..., 2} 
in 0{mN) operations. 

5. Compute f{i) = Vi/i + vf/{N — i) for every i G {1,..., A — 1} in 0{N) operations. 

6. Find i that minimizes f{i) and output {1, ■ ■ ■ ■,%} and {i + 1,..., A} in 0{N) operations. 

Note that in the special case where m = 1, the above method exactly solves the fc-means problem 
when /c = 2 in only 0(Alog A) operations, recovering the rounding step of Peng and Wei [20) . For 
comparison, |23j leverages more sophisticated dynamic programming for the m = 1 case, but k is 
arbitrary and the algorithm costs 0{kN‘^) operations. 

See Algorithm for a summary of our relax-and-round procedure. As a spectral method, 
this algorithm enjoys quasilinear computational complexity; see Figure for an illustration. In 
particular, when computing the leading eigenvector of 4 >q <ho, each matrix-vector multiply in the 
power method costs only 0{mN) operations. Furthermore, as the following result guarantees, this 
algorithm performs well under the stochastic ball model: 

Theorem 14. Let A* = A*(2?, fc) denote the smallest value for which A > A* implies that mini¬ 
mizing the k-means objective recovers planted clusters under the {'D,'y,n)-stochastic ball model with 
probability 1 — When k = 2, spectral k-means clustering (Algorithm^^ recovers planted 

clusters under the stochastie ball model with probability 1 — provided A > A*. 
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See Appendix!^ for the proof. The main idea is that the leading eigenvector of is biased 

towards the difference between the ball centers, and as the following lemma establishes, this bias 
encourages spectral /c-means clustering to separate the planted clusters: 


Lemma 15. Take two clusters contained in unit balls centered at 7 and —7 with II 7 II 2 > 1- If mini¬ 
mizing the k-means objective recovers these clusters, then spectral k-means clustering (Algorithm^ 
also recovers them, provided the leading eigenvector z of ^ 0^0 satisfies z\ > \\z\\ 2 - 


Proof. Write $0 = ^ ~ 1 P^t 0 := —pAz, and observe that y = d>Q z is a leading eigenvector of 

<hod>o. Then 

yi = {xi- nY z = xj z+ 6 (18) 


for every i. Next, if \^'z\ > \\z\\ 2 , then a simple trigonometric argument gives that the balls (and 
therefore the planted clusters) are separated by the hyperplane orthogonal to z. Combined with 
(18), we then have that the clusters can be identified according to whether yi < 6 or yi > 9. It 
therefore suffices to minimize the fc-means objective subject to partitions of this form (for arbitrary 
thresholds 9), as so spectral A:-means clustering succeeds. □ 


5 Discussion 


This paper discussed various facets of probably certifiably correct algorithms for /c-means clustering. 
There are still many questions that have yet to be answered: 


• Let A*{'D, k) denote the smallest value for which A > A* implies that minimizing the A:-means 
objective recovers planted clusters under the {V, 7 , n)-stochastic ball model with probability 
1 — What is A*? It was conjectured in [1] that A* = 2, but as we demonstrated in 

Subsection |2.3t this is not the case. 


Let Agpp(21,/c) denote the smallest value for which A > Ag^p implies that solving the k- 
means SDP recovers planted clusters under the ifD, 7 , n)- stoc hastic ball model with probability 
I — What is Agpp? Considering Subsection 2.3 and Figure [^center), we suspect 


the SDP exhibits a performance gap: Agpp > A*. 


Is there a single dual certificate for the A:-means SDP that typically certifies planted clus¬ 
ters under the stochastic ball model whenever A > Agpp? Does this certification have a 
quasilinear-time implementation similar to Subsection 3.2? 


• Is there a quasilinear-time fc-means solver that typically solves A:-means under the stochastic 
ball model whenever A > A*? In particular, is there a quasilinear-time initialization of 
Lloyd’s algorithm that meets this specification? Following the philosophy of Section such 
algorithms should be designed so as to not “see” the stochastic ball model. 
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A Proof of Corollary 


It suffices to have 


I112-5.2 + IJRa-L-SPa^ Il2-s>2 < 


(19) 

We will bound the terms in (19) separately and then combine the bounds to derive a sufficient 
condition for Theorem]^ To bound the first term in (19), let u be the iV x 1 vector whose (a,i)th 
entry is llxa,ill 2 ; and let ‘h be the mx N matrix whose (a,f)th column is Xa^i- Then 


7^{a,i),{b,j) — \\^a,i ^b,ill2 — Il®a,*ll2 + Il®6,ill2 — -|- ll^ )j 

meaning D = — 2<1>^4> -|- . With this, we appeal to the blockwise definition of M ([^: 

II 2 -S .2 = I| 2^-2 = ||-fA-L(2^l''' “ 24>'''4> -I- )Pf^i_ ll2-5>2 

= 2\\P^±^'^ ^P^±\\2^2 = 2\\^Pa^\\1^2 = 2||^'||L2- 


For the second term in (|19|), we first write the decomposition 


B = 


k k 

E E 

a=l 6=a+l 


where H^g^ b) • —)■ produces a matrix whose (a, 5)th block is the input matrix, and is 

otherwise zero. Then 

k k 

PAxi?PAX=E E 

a=l b=a-\-l 

k k 

a=l 6=a+l 
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and so the triangle inequality gives 

k k 

\\P^±BPj^±\\ 2^2 < E E 

a=l b=a-\-l 

= E E 

a=l b=a+l 


where the last equality can be verihed by considering the spectrum of the square: 

(P(,,,)(AxP(“’')Pix) + P(b,,)(PixP('’“)Pix))' 

= P(„,,)((PlxP(“’')Pix)(PixP(“’')Pix)T) +P(,,,)((PlxP(“’')Pix)T(PlxP(“’')Pix)). 
At this point, we use the definition of B 0 to get 




li || 2->2 


l|-Pl-m(ffl,&)l|2||A^^(b,a)l|2 

P{a,b) 


Recalling the definition of U(^a,b) (p^ and combining these estimates then produces the result. 


B Proof Theorem [9] 

In this section, we apply the certificate from Corollary to the (P, 7 , n)-stochastic ball model (see 
Definition]^ to prove our main result. We will prove Theoremwith the help of several lemmas. 

Lemma 16. Denote 


1 


n 


C-a ■— 

i=l 


^ab ■— 11'7a T&II 2 ; Oab • — 


7a + 7b 


Then the {TD^^^n)-stochastic ball model satisfies the following estimates: 


n 


Z =1 


n 


E 


2 = 1 


Xa,^ := 


- 7a 5 

< e 

w.p. 


( 20 ) 

Elklli 

< e 

w.p. 

1 _ 

( 21 ) 

OabWl 

< e 

w.p. 

1 _ 

( 22 ) 

y one may lift 



0 

rT. 1 

a ,2 





^ a ,2 0 


and apply the Matrix Hoeffding inequality [2T] to conclude that 


Pr 




2=1 


>7 


Taking t := en then gives (20). For (21) and (22), notice that the random variables in each 


sum are iid and confined to an interval almost surely, and so the result follows from Hoeffding’s 
inequality. □ 
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Lemma 17. Under the {'D,j,n)-stochastic ball model, we have = 4np + q, where 


Pi ■= - Oah) + ^ 


Qi ■ 277.(X(i,i Oab) f (Ca Cfo) (Tq 'lb) ) ^ ^ Oa6||2 ^ ^ II^QJ C^a6||2 ) 

V / / 


and |(?i| < (6 + 2Aab)ne with prohahility 1 — e 

Proof. Add and subtract Oab and then expand the squares to get 


I Il2 II ||2 

\i^a,i ■^b,j\\2 / j ^a,jll2 


= n - 


- re - 


eT(^(a,b)i_^(a,a)i) 

i=i i=i 

r 2{Xa,i Oab) ipb Oab) P ~ ^ ^ ll^bj ^a6|l2 j 

\ j=l 

{ 2{Xa,i Oab) (Ca Oab) + “ ^ ^ ll^aj ^06112^ 

^ i=i ^ 

( n n \ 

^ ^ ll^fej 0^^112 ^ ^ ll^aj C^a6|l2 j ' 

i=i i=i ^ 

Add and subtract 7 a — 7 b to Cq — c;, and distribute over the resulting sum to obtain 
ej = 2n{xa,^ - Oab)^( 7 a -lb)+q 

= ^n(ra,i + i'Ja - Oab)^ i^a - Oab) + Q- 

Distributing and identifying ||7a — OabII 2 = ^ab/^ explains the definition of p. To show |gj| < 
(6 + 2Aab)ree, apply triangle and Cauchy-Schwarz to obtain 


\Qi\ < 


( \ IL fL 

(Ca Cb) (7a 7b) j T ^ ^ ll^f'ii Oab||2 ^ ^ ll^fij Oab|| 

^ i=i i=i 

^ 2 re ( ||raji ||2 T I| 7 a Oa,b||2 j f ||Ca 7 a II2 T ||cb 7b||2 j T ^ ^ ll^f'J Oab||2 ^ ^ ll^aj Oab||2 

^ ^ ^ ^ 7 = 1 7 = 1 


A 


< 2re( 1 + ) ( ||Ca - 7a||2 + ||cb “ 7b||2 ) + 


y ll^fcj Oab ||2 ^ ll^gj Oab ||2 

i=i i=i 


To finish the argument, apply (20) to the first term while adding and subtracting 

IE||r + 7a - Oablli = IE||r + 7b - OabWl, 


from the second and apply (22). 

Lemma 18. Under the (T),^,n)-stochastic ball model, we have 
1 


□ 


re 


-iT^Ca’a)! _ 2reE||r|| 


< 4ree w.p. 1 — e ^ 
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Proof. Add and subtract 7 a and expand the square to get 


= - V 

n n ^ 


i=i 


--fa) + 

’ n 


j=i 


The triangle and Cauchy-Schwarz inequalities then give 
- 1 Td(“’“)i - 2nE\\r\\l 


n 


n , n X 

X] ( “ 2r^i(ca - 7a) + - X] Il^“.ill 2 ) - 2nE||r||| 

i=i ^ ” i=i 2 


< n 


< n 


n 


n 


\\ra,i\\l - E\\r\\l +2Y l^liica “ 7a)| + n - ^ \\ra,j\\l - E\\r\\l 


i=l 

n 


2=1 

n 


j=l 


Y Il’’“-*ll 2 -IE||r||i + 2 ^ ||ca - -fah + n - Y \K,j\\l - E\\r 


2=1 


2=1 


i=i 


< 4ne, 


where the last step occurs with probability 1 —e by a union bound over ( 21 ) and ( 20 ). □ 

Lemma 19. Under the {'D,'y,n)-stochastic ball model, we have 

lT^(a,6)^ _ ^T^(a,a);^ > ^2^2^ _ y;.p. 1 _ _ 


Proof. Lemma 17 gives 

lT^(a, 6 )^ _ ;^T^(a,a)^ ^ {Anp + q) 


> 4n ^ f r^i( 7 a - Oab) + ^ ) - (6 + 2/\ab)n^e 

i=l ^ 2 


> 4n( n(ca - 7a)"^(7a - Oab) + ) - (6 + 2Aab)n^e. 


Cauchy-Schwarz along with (20) then gives the result. 

Lemma 20. Under the {T>,-f,n)-stochastic ball model, there exists C = C{'f) such that 
min min(M*'“’^^l) > nA(A — 2) + Cne w.p. i — ^ 

a^b 

where A := min A„!,. 

a,bG{l,...,fc} 

a^b 

Proof. Fix a and b. Then by Lemma 17 the following holds with probability 1 — e~^^’‘^ab:‘0). 

min > 4n .^min^^ (^rj/7a - Oab) + - (6 + 2Aa6)ne 

> nAlf, - 2nAab - (6 + 2Aab)ne, 


□ 
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where the last step is by Cauchy-Schwarz. Taking a union bound with Lemma 18 then gives 
min(M(“’^h) 


= mm 


+ 


n 


- 2 nE||r| 


> min ^ ^ _ 2nE\\r\\l 

> nAabi^ab - 2) - (10 + 2Aab)ne 

with probability 1 — The result then follows from a union bound over a and h. 


□ 


Lemma 21. Suppose e < 1. Then there exists C = C{Aab,m) such that under the ( 21 , 7 , n)- 
stochastic ball model, we have 

+ Cn^e 


m 


with probability 1 — e . 

Proof. First, a quick calculation reveals 


T^(a,b)i ^ ^ f ^ iT jjib,b)-^^ 

2\n 


n 


n n 2 \n n ) 

from which it follows that 

n 

= (^ejD^^’’’h - 
= e7A-L(-D^“’^^l - 

As such, we have 

= - p(“’“)i||2 - ||Pi(p(“’^)i - 

To bound the first term, we apply the triangle inequality over Lemma [T7| 

pK6)i _ £,K“)i ||2 < 4n\\p\\2 + \\qh < 4n||p||2 + (6 + 2AabW/\. 


(23) 


(24) 


We proceed by bounding ||p|| 2 - To this end, note that the pfs are iid random variables whose out¬ 
comes lie in a finite interval (of width determined by Aab) with probability 1. As such, Hoeffding’s 
inequality gives 




2=1 


< e 


w.p. 


1 _ 


25 










With this, we then have 


WpWI = - Epl + Epf] < nEpl + ne 

i=i ^ 


(25) 


in the same event. To determine Ep\, first take ri := e^r. Then since the distribution of r is 
rotation invariant, we may write 


Aab 


A? 


Pi = ra,l{la - Oab) + Wla - Oabh = 

where the second equality above is equality in distribution. We then have 

Epl = E(—n + ^] =^Erl + ^. 

\ 2 4. J 4 ^16 

We also note that 1 > IE||r ||2 = mEr^ by linearity of expectation, and so 


(26) 


Erf < —. 

m 


Combining (24), (25), (26) and (27) then gives 


/4r)3A2 \ V2 

||T>(“’^)1-D(“’“)i|| 2 < ( ^ +n^A^fc + 16n^ej + {6 + 2AahW/h. 


To bound the second term of (^^, first note that 


|Pl(T»(“’'’h - T)(“’“h)||2 = ^ _ lT£)(a.a)^ 

\/n 


Lemma 19 then gives 

lT^(a,6)i _ ;^T^(a,a)^ > lT^(a,6)^ _ j^{a,a) ^ y ^2^2^ - (6 + 4Aab)r?e 


(27) 


(28) 


(29) 


(30) 


with probability 1 — e Using (23) to combine (28) with (29) and (30) then gives the 

result. □ 

Lemma 22. There exists C = C{'y) such that under the (T>, ^, n)-stochastic ball model, we have 


P(a,b) > n^{^lb - - 2)) - Cn^ 


w.p. 


\ _ g-llE>,7,e(”-) 


Proof. Recall from (0 that 


Pia,b) = «(a,6)l = -nz = - n min min(M(“’^h). 

a^b 


(31) 


To bound the first term, we leverage Lemma 19 






> n^A^f, - (6 + 4Aaf,)n^e 
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with probability 1 — e To bound the second term in (31), note from Lemma 

min(M(“’^h) 


18 


that 


= mm 

< min 

< min 


^ ^ jj{a,a)-^ _ 2nE||r||| 

_ dMi'^ +4ne 


n 


_lT2)(fe,6)l _ 2nE||r| 


with probability 1 — e Next, Lemma 


17 


gives 


mm 


_£)(“-“)I'j < +(6 + 2Aafe)ne + 4n min rlii^la - Oab)- 

V / ie{l,...,n} 


By assumption, we know ||r ||2 > 1 — e with positive probability regardless of e > 0. It then follows 
that 

r^ila - Oab) < + e 

with some (e-dependent) positive probability. As such, we may conclude that 


. min rliiria - Oab) <-^ + e 


w.p. 


\ _ g —^'D,e(’T') 


Combining these estimates then gives 

min(A/i“’^il) < — 2nAab + (10 + 2Aab)ne w.p. 1 — _ 

Performing a union bound over a and h then gives 

min min(Mi“’^il) < nA^ — 2nA + (10 + 2A)ne w.p. 1 — ^ 

a^b 

Combining these estimates then gives the result. 

Lemma 23. Under the {'D,’y,n)-stochastic ball model, we have 

11 ^ 112^2 e^\/]V w.p. I - e-^m,k,aAO^ 

\ ym J 

where := E||r ||2 for r ^ T). 

Proof. Let R denote the matrix whose (a, i)th column is r^^j. Then 


□ 


= R- 

and so the triangle inequality gives 


(ci - 7i)l 


(cfe -7fc)l 


1^1 


2^-2 


< ||.R||2^-2 + (ci — 7l)l 


T 


(Cfc - 7A:)1 


T 


< 


2-s>2 


\\R\\2^2+ [n^\\Ca--ia\\l 


a=l 


1/2 


where the last estimate passes to the Frobenius norm. For the first term, since P is rotation 
invariant, we may apply Theorem 5.41 in 


□ 


Pl| 2^2 < (l + e)ai/- w.p. 

V m 


For the second term, apply (20). The union bound then gives the result. 
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Proof of Theorem^ First, 
e > 0 such that 


we combine Lemmas 21 22 and 23: For every 5 > 0, there exists an 


2||'I' 


11^2 + X] X] 

a=l &=a+l 






P{a,b) 
k k 


An^A'^ijfm + Cn^t 


a=l b=a-\-l 


v?{Al^ - A(A - 2)) - Cii^e 


<n(— + -Y V _ ^ _ 


+ (^ 


with probability 1 — e Next, the uniform bound Aab > A implies 




< 


A 


A2, - A(A - 2) 1 - A(A - 2)/A2, “ 1 - A(A - 2)/A^ 2 

Combining this with (|32|) and considering Lemma |20l it then suffices to have 


2k 4 

-1- 

m m 



.|<A(A-2). 


(32) 


Rearranging then gives 


A > 2 + 


2k 

mA 


+ 


fc(fc- 1) 
m 


which is implied by the hypothesis since A > 2. 


□ 


C Proof of Theorem 14 


Put g = 7 /II 7 II 2 and let z have unit 2-norm. Since ||<1 >q 2;||2 > 
it suffices to show that the containment 


1^0 5II 2 ) then considering Lemma 


15 


5i := <^ u G 




ves 


m—1 


|4>n u| 


2 < ll‘J’(}fl'l|2 : S 2 


holds with probability 1 — To this end, we will first show that each u G S'! is also a 

member of S 2 with high probability, and then we will perform a union bound over an e-net of Si. 

We start by considering ||$''~u ||2 and ||<h''' 5 '|| 2 . Decompose Xi as either 7 -l-rj or —y-t-r^ depending 
on whether Xi belongs to the ball centered at 7 or — 7 . Let w with ||rc ||2 = 1 be arbitrary. Then 

{xjw)‘^ = ((±7 + rYw)"^ = (A'y'^w + rjw)‘^ = (7'''u;)^ ± 2 {Yw){rjw) + {rjw)"^, 
and so ^{xjw)"^ = -|-IE(e7?')^- Linearity of expectation then gives 

mxlgf - (x]vf] = ( 7 ^ 9 )= - (7^7)2 = ||7|7l - (77) > 1 - 
Since \ {xjg)‘^ — {xjv)‘^\ < 2(1 -|- A/ 2 )^ almost surely, we may apply Hoeffding’s inequality to get 


l‘^>^5lli- 


|$^u||^ = 


N / A \ 

^ [{xjgf - (xjvf^ >n(i- ^ j - s 

i=l ^ ^ 


w.p. 1 — e 


-QAis^/N) 


(33) 
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For a properly chosen t, rearranging gives that ||<l>''^z ;||2 < ||<h''^g'|| 2 . Instead, we will use (33) to prove 
the closely related inequality ||<hQ u ||2 < ||<I>q 5 || 2 . Letting fj, denote the centroid of the columns of 
<h, we know by (20) that ||/r ||2 < <5 with probability 1 — In this event, every w with 

lltclb = 1 satisfies 


‘hdrc||2 — ll^'w'lbl = |||(^ + /^l')''u^||2 — 


T\T„ 


= |||<h''~rc + l^''~rc||2 — ||$''~rc||2| < ||l/i''~rc||2 < VN6. 


(34) 


Furthermore, 


ll^d'w^lb = II Vwh < ll^'u^lb + \\ln~^w\\2 < + 1 + llAilb^, 

where the last inequality follows from Cauchy-Schwarz along with the fact that ||xj ||2 < A/2 + 1 
for every i. Taking a supremum over w then gives 


l^ho" ||2^-2 A \/)V I — + 1 + 


2 ) < + i + 


w.p. 1 — e 




(35) 


In (33), pick s = {N/2){1 — 4/A^) =: ci{A)N. Then taking a union bound with ( p4| ) gives 

(ll^d^lb - < ll^^^^lli < ||$^ 5 llici(A)A < (ll^dfflb + \/^<5)^ - ci(A)A 

with probability 1 — Expanding both sides and rearranging then gives 

11‘J’d^^lli < ll'^'d^lli + 2%/iV<5(||ckdu||2 + ||<hd<7||2) - ci(A)A 

'A 


A ll'hg 5112 — { ci(A) — 45( — + 1 + (5 


iV, 


C2(A) 


where the last step follows from (35). Thus, picking S = 5(A) sufficiently small ensures C 2 (A) > 0. 
Since C 2 (A)A < ||4>d5ll2 “ Il^d^il 2 = (ll^ds'lb + ||4>d^ll2)(ll^d9ll2 - ||4>d'*^ll2), we further have 


l^d^lb - ll^d^'lb > 


C2{A)N 


l^ds'lb + ||^([u||2 


> C3{A)VN, 


where the last inequality takes C 3 (A) := C2(A)/(A/2 + 1 + 5), following (35). 

At this point, we know that if u G 5i, then u G S '2 with probability 1 — It remains 

to perform a union bound over an e-net of S'! to conclude that Si C S 2 with high probability. To 
this end, pick e < C3(A)/(A/2 + 1 + 5), consider an e-net A// of Si, and suppose 


|‘I*dw||2 < ll'hd^lb — c3(A)-\/]v Vu g A4. 


(36) 


Then for every x G there exists u G A4 such that ||x — u ||2 < e, and so (35) gives 


||‘I’da :||2 < ||4>dII 2 -S. 2 1|3^ ~ 'i^lb + ll^o"'*^11 2 A \/)/V+ l + 5^e + ||<I>d5||2 ~ C 3 (A)\/]V < ||<hd5ll2) 

as desired. To measure the probability of the success event (36), a standard volume comparison 
argument establishes the existence of an e-net of size |A4| < (1 + 2/e)"*; see Lemma 5.2 in [22]. As 
such, the union bound gives that (36) occurs with probability 1 — 
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